Skip to content

Stabilize benchmark signal with variance-aware sampling and rolling baselines#4622

Merged
timotheeguerin merged 7 commits into
mainfrom
copilot/improve-benchmark-accuracy
Jun 30, 2026
Merged

Stabilize benchmark signal with variance-aware sampling and rolling baselines#4622
timotheeguerin merged 7 commits into
mainfrom
copilot/improve-benchmark-accuracy

Conversation

Copilot AI commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Benchmark results were overly sensitive to GitHub runner noise, causing high run-to-run deviation and unreliable PR deltas. This change makes benchmark comparisons more stable by increasing sample quality, gating on variance, and comparing against a rolling mainline baseline instead of a single latest run.

  • Sampling and runner stability

    • Benchmark workflow now uses higher-fidelity defaults (warmup=3, iterations=25).
    • Added configurable benchmark runner selection (workflow input + repo variable) to support dedicated/stable runners.
    • Pinned Node version in benchmark workflow setup to reduce environment drift.
  • Variance-aware execution (noise gate)

    • Added per-spec runtime variability statistics (mean/median/stddev/CV/min/max/sample count).
    • Introduced optional noise-gating reruns when total runtime CV exceeds threshold:
      • --noise-cv-threshold
      • --max-reruns
      • --rerun-iterations
    • Runner now records whether reruns were triggered and how many were performed.
  • Rolling baseline for PR comparisons

    • PR comment baseline now prefers a rolling aggregate over recent main history (results/history.json) with fallback to results/latest.json.
    • Added --baseline-window to control rolling window size.
    • Baseline labeling now distinguishes synthetic rolling baselines from commit SHAs.
  • Benchmark output and docs updates

    • Added shared statistics utilities for variability calculations.
    • Updated benchmark summaries/comments to surface variability context.
    • Updated benchmark README and tests to cover new CLI/options and formatting behavior.
node packages/benchmark/dist/src/cli.js run \
  --iterations 25 \
  --warmup 3 \
  --noise-cv-threshold 0.08 \
  --max-reruns 1 \
  --rerun-iterations 10 \
  --output /tmp/benchmark-results.json

Copilot AI and others added 2 commits June 11, 2026 21:43
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
@azure-sdk

Copy link
Copy Markdown
Collaborator

No changes needing a change description found.

@azure-sdk

Copy link
Copy Markdown
Collaborator

You can try these changes here

🛝 Playground 🌐 Website

@timotheeguerin

Copy link
Copy Markdown
Member

@copilot fix format, spellcheck and run benchmark ci

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot AI commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

@copilot fix format, spellcheck and run benchmark ci

Addressed in commit c226b35.

  • Fixed formatting issues (prettier --check now passes).
  • Resolved spellcheck by renaming unflattenRuntime to expandRuntimeMetrics.
  • Ran benchmark CI-equivalent checks:
    • pnpm -r --filter "@azure-tools/typespec-benchmark..." build
    • pnpm --filter @azure-tools/typespec-benchmark test
    • benchmark run command with CI flags completed successfully.

@timotheeguerin

Copy link
Copy Markdown
Member

@copilot run benchmark workflow still failing

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot AI commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

@copilot run benchmark workflow still failing

Fixed in commit 093bcfe.

Root cause was Node engine mismatch in the benchmark workflow (pnpm install failed with ERR_PNPM_UNSUPPORTED_ENGINE on Node 24.11.1). I updated .github/workflows/benchmark.yml to use Node 24.15.0, which satisfies the dependency engine requirement.

Comment thread packages/benchmark/src/statistics.ts
@timotheeguerin

Copy link
Copy Markdown
Member

@copilot benchmark run into OOM


  Benchmarking: azure-core-dataplane
    Warmup 1/3...
    Warmup 2/3...
    Warmup 3/3...
    Iteration 1/25...
    Iteration 2/25...
    Iteration 3/25...
    Iteration 4/25...
    Iteration 5/25...
    Iteration 6/25...
    Iteration 7/25...
    Iteration 8/25...

<--- Last few GCs --->

[4813:0x9db3000]   429954 ms: Scavenge (interleaved) 4048.6 (4076.5) -> 4041.3 (4114.7) MB, pooled: 0 MB, 34.43 / 0.00 ms  (average mu = 0.217, current mu = 0.229) allocation failure; 
[4813:0x9db3000]   431946 ms: Mark-Compact (reduce) 4062.8 (4116.2) -> 4047.8 (4061.7) MB, pooled: 0 MB, 145.02 / 0.00 ms  (+ 1717.1 ms in 328 steps since start of marking, biggest step 8.1 ms, walltime since start of marking 1992 ms) (average mu = 0.211,
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
----- Native stack trace -----

 1: 0x73f8c4 node::OOMErrorHandler(char const*, v8::OOMDetails const&) [node]
 2: 0xc06f90  [node]
 3: 0xc0707f  [node]
 4: 0xeaa885  [node]
 5: 0xeaa8b2  [node]
 6: 0xeaabaa  [node]
 7: 0xebb8aa  [node]
 8: 0xebfc50  [node]
 9: 0x1953f71  [node]
/home/runner/work/_temp/a4a90c35-7d83-4657-bd54-f22379613fdf.sh: line 9:  4813 Aborted                 (core dumped) node packages/benchmark/dist/src/cli.js run --specs-dir packages/benchmark/specs --iterations 25 --warmup 3 --noise-cv-threshold 0.08 --max-reruns 1 --rerun-iterations 10 --commit 6f075c146cd770a97dca3d420e095e7220bdeb57 --output /tmp/benchmark-results.json
Error: Process completed with exit code 134.

Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot AI commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@copilot benchmark run into OOM


...

Fixed in 1248489.

I updated the benchmark workflow to run the benchmark CLI with a larger Node heap (--max-old-space-size=6144) for both run and backfill, which addresses the OOM seen during high-iteration runs.

@timotheeguerin

Copy link
Copy Markdown
Member

@copilot still OOM, increasing memory doesn't seem to be the right solution anyway, what is causing this increase in use, is there a memory leak with an emitter or the benchmark code?

Copilot AI and others added 2 commits June 24, 2026 17:12
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>
Co-authored-by: timotheeguerin <1031227+timotheeguerin@users.noreply.github.com>

Copilot AI commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

@copilot still OOM, increasing memory doesn't seem to be the right solution anyway, what is causing this increase in use, is there a memory leak with an emitter or the benchmark code?

Root cause is memory accumulation in the benchmark process, not a single emitter regression: repeated in-process compile() calls keep heap growth across iterations/specs until the run OOMs. I fixed this by isolating each iteration compile in a short-lived child process so memory is reclaimed between iterations while keeping the same benchmark output/metrics (3f2a5f9). I also validated benchmark build/tests and reran the high-iteration benchmark command successfully.

@timotheeguerin timotheeguerin enabled auto-merge June 30, 2026 14:55
@timotheeguerin timotheeguerin added this pull request to the merge queue Jun 30, 2026
Merged via the queue into main with commit 8d517e0 Jun 30, 2026
22 checks passed
@timotheeguerin timotheeguerin deleted the copilot/improve-benchmark-accuracy branch June 30, 2026 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants